Skip to content

Modular content parser: YouTube + Instagram + Reddit#1

Draft
codeby wants to merge 33 commits into
mainfrom
claude/cloud-setup-4v9hy
Draft

Modular content parser: YouTube + Instagram + Reddit#1
codeby wants to merge 33 commits into
mainfrom
claude/cloud-setup-4v9hy

Conversation

@codeby
Copy link
Copy Markdown
Owner

@codeby codeby commented Apr 28, 2026

Summary

  • Refactored YouTube parser into a modular plugin system (core + plugins)
  • Added Instagram plugin via Apify's instagram-scraper (рилсы, посты, хэштеги, аккаунты)
  • Added Reddit plugin via PRAW read-only (subreddits, search, posts, users)
  • Generic Streamlit UI auto-renders forms from each plugin's input/settings specs
  • Unified CLI: python -m content_parser.cli {list-sources, run --source ...}
  • Old youtube_parser/ keeps working via re-export shims; the legacy CLI is now a thin translation layer over the new one
  • 102 unit tests (stdlib unittest, no extra deps)

What's where

content_parser/
├── core/
│   ├── schema.py        # Item / Comment / Transcript dataclasses
│   ├── plugin.py        # SourcePlugin ABC + InputSpec / FieldSpec
│   ├── registry.py      # plugin discovery (logs real bugs, silent on missing optional deps)
│   ├── runner.py        # resolve → fetch → write (try/finally for partial-run safety)
│   ├── output.py        # Item → JSON / Markdown / CSV / index (path-traversal-safe)
│   ├── secrets.py       # st.secrets → env → ~/.content_parser/config.json + .streamlit/secrets.toml
│   └── errors.py
├── plugins/
│   ├── youtube/         # ported existing logic + adapter to Item
│   ├── instagram/       # Apify HTTP client (Bearer auth) + adapter + plugin
│   └── reddit/          # PRAW client + adapter + plugin
├── ui/app.py            # Streamlit UI with dynamic forms
└── cli.py               # unified CLI entry

Setup for Streamlit Cloud

Add to Settings → Secrets:

YOUTUBE_API_KEY = "AIza..."           # for YouTube plugin
APIFY_API_TOKEN = "apify_api_..."     # for Instagram plugin
REDDIT_CLIENT_ID = "xxx"              # for Reddit plugin
REDDIT_CLIENT_SECRET = "yyy"          # for Reddit plugin
REDDIT_USER_AGENT = "myapp:v1 by /u/me"  # optional, recommended

# Optional, only if YouTube transcripts get blocked from a datacenter IP:
WEBSHARE_USERNAME = "..."
WEBSHARE_PASSWORD = "..."

Security highlights (covered by tests)

  • _safe_filename applied to source and item_id (defense in depth against malicious upstream IDs); _file_stem appends a short hash when sanitization collapses the id, so collisions can't clobber files
  • Apify token sent in Authorization: Bearer … header (not query string) so it doesn't leak into nginx logs
  • TOML upsert escapes \ and " so secrets containing these characters round-trip cleanly
  • Reddit URL host validation is exact-match (.reddit.com / .redd.it) — evilreddit.com rejected
  • _redact_spec strips both query and fragment from URLs before they enter exception messages or logs
  • Streamlit secret fields are type="password"; ~/.content_parser/config.json and .streamlit/secrets.toml are written chmod 600
  • output/ is in .gitignore (contains scraped comments, possibly PII)

Roadmap (not in this PR)

  • Этап B: внешние input loaders (CSV / Google Sheets / YAML)
  • Этап C: cron-планировщик (cli jobs install-cron)
  • Опционально: Whisper-транскрипция рилсов

Test plan

  • pip install -r requirements.txt succeeds
  • python -m content_parser.cli list-sourcesyoutube, instagram, reddit
  • python -m unittest discover -s tests → 102 passed
  • YouTube: python -m content_parser.cli run --source youtube --video https://youtu.be/...
  • Instagram (needs APIFY_API_TOKEN): python -m content_parser.cli run --source instagram --account nasa --set max_posts_per_input=5
  • Reddit (needs REDDIT_CLIENT_ID/SECRET): python -m content_parser.cli run --source reddit --subreddit python --set listing=top --set time_filter=week
  • Back-compat: python -m youtube_parser.main --video URL --max-comments 10 still works
  • streamlit run app.py shows source selector with all 3 plugins; tabs render dynamically
  • Streamlit Cloud picks up the new UI on next deploy

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta

claude and others added 11 commits April 28, 2026 19:24
CLI tool that resolves search queries, channels, playlists, or video URLs
to a list of videos, then fetches top-level comments (optionally with
replies) via the YouTube Data API and transcripts via youtube-transcript-api.
Writes per-video JSON + Markdown plus a summary CSV and index.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Browser-based form wraps the existing parser modules: queries, channels,
playlists, and videos as separate tabs; sidebar holds API key and limits;
runs stream live status into the page; results are downloadable as a
single ZIP or as summary.csv.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Falls back to the YOUTUBE_API_KEY environment variable for local runs.
Wraps st.secrets access in try/except so a missing secrets.toml does not
crash the app locally.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The Streamlit app now reads in Russian end-to-end. Added Save / Delete
buttons next to the API key field that write the key to
~/.youtube_parser_config.json (chmod 600). Loading order on startup:
st.secrets → $YOUTUBE_API_KEY → saved file. .gitignore added to keep
caches, virtualenvs, the secrets file, and parser output out of git.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The Save button now writes the key to both
~/.youtube_parser_config.json and .streamlit/secrets.toml so it is
available globally and via st.secrets in the same Streamlit project.
The TOML upsert preserves any other keys in the file and deletes the
file if removing the key leaves it empty. Delete clears both locations.
The status caption lists every place the key is saved.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
The 1.x release replaced the static YouTubeTranscriptApi.list_transcripts
class method with an instance method (api.list / api.fetch). The old code
silently failed for every video because the broad except returned None
on the AttributeError, so the UI always reported "no transcript".

Rewrite transcripts.py against the new API and switch to a verbose
return shape so callers can distinguish disabled, missing, and blocked
cases. Both the Streamlit app and the CLI now report the actual reason
when no transcript is produced. Pinned the dependency to >=1.0.0.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
YouTube blocks transcript requests from datacenter IPs (Streamlit Cloud,
GCP, AWS), surfacing as RequestBlocked. Add a proxy_config kwarg to the
transcripts module and a sidebar section in the Streamlit app to choose
Webshare (rotating residential proxies) or a generic HTTP proxy.

Defaults are pulled from st.secrets (WEBSHARE_USERNAME, WEBSHARE_PASSWORD,
PROXY_HTTP_URL, PROXY_HTTPS_URL) or environment variables, so creds set
in the Streamlit Cloud dashboard load automatically.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Adds a source-agnostic core (schema.py with Item/Comment/Transcript
dataclasses, plugin.py with the SourcePlugin ABC plus InputSpec/FieldSpec,
registry.py, runner.py, secrets.py, output.py, errors.py) so additional
sources can plug in alongside YouTube without touching the core.

The existing YouTube modules move into content_parser/plugins/youtube/
with an adapter that converts API dicts into the new Item schema and a
YouTubePlugin implementing the contract. The youtube_parser/sources.py,
comments.py, and transcripts.py become one-line shims that re-export
from the new location, so existing callers (app.py, youtube_parser.main)
keep working unchanged.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
content_parser.cli exposes 'list-sources' and 'run --source ... --input
KIND=VALUE --set KEY=VALUE'. Convenience aliases (--query, --channel,
--video, --hashtag, --account, --post) and key=value setting overrides
make scripted runs ergonomic.

content_parser/ui/app.py renders the Streamlit interface from each
plugin's input_specs() and settings_specs(), so adding a new source
needs no UI changes. Sidebar manages secrets per plugin (load from
st.secrets/env/config.json, save/clear buttons), the proxy block shows
only when the active plugin has a proxy_provider setting.

Root app.py is now a 3-line shim into content_parser.ui.app.main, so
Streamlit Cloud picks up the new UI on next deploy. The legacy
youtube_parser.main CLI keeps working unchanged via the back-compat
shims introduced in the previous commit.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
InstagramPlugin handles three input kinds — hashtags, accounts, and
direct post/reel URLs — and runs them in a single Apify actor call. The
adapter maps Apify post fields (likesCount, videoViewCount, musicInfo,
latestComments with nested replies) into the unified Item schema, with
audio_id and audio_title surfaced under media for trend research.

ApifyClient is a thin wrapper around run-sync-get-dataset-items with
explicit handling of 401 (bad token) and 402 (out of credits). The
plugin auto-registers via content_parser.core.registry, so the CLI and
Streamlit UI pick it up without further changes — confirmed via
'python -m content_parser.cli list-sources'.

Adds requests>=2.31.0 to requirements.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
@codeby codeby changed the title Add YouTube parser for comments and transcripts Modular content parser: YouTube + Instagram plugins Apr 29, 2026
claude added 6 commits April 29, 2026 07:34
- registry.py now distinguishes ImportError (optional dep missing — silent
  at DEBUG) from any other exception (typo, runtime bug — printed to
  stderr) so plugins no longer disappear without explanation.
- runner.py wraps the fetch loop in try/finally; summary.csv and index.md
  are flushed even when fetch raises mid-iteration, so partial runs stay
  inspectable. The original exception is re-raised after.
- secrets.py escapes backslashes and double quotes when writing values to
  .streamlit/secrets.toml, so a value containing a quote no longer
  produces a malformed TOML file that breaks st.secrets on next start.

Verified with a mini test harness: TOML round-trips a value like
'a"b\\c' through tomllib, the runner produces summary.csv after a forced
mid-loop crash, and registry.warns on a NameError while staying silent
on a missing optional import.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…der auth

- Routing per input kind: hashtags + accounts go to one Apify call with
  the user-chosen resultsType ('posts' by default). Explicit post/reel
  URLs go to a second call with resultsType='details', since 'posts' on
  a single-post URL returns nothing useful. The runner sees this as one
  fetch generator yielding all results combined.
- _normalize_account refuses URLs whose first path segment is /p/, /reel/,
  /explore/, etc. — those used to silently turn into a request for a
  username like 'p', returning empty data with no clear error. Also
  validates username characters against Instagram's allowed set.
- resolve() raises a PluginError if a value in the post_url field doesn't
  look like /p/ or /reel/, so users catch the mistake before paying for
  a useless Apify run.
- ApifyClient sends the token in the Authorization: Bearer header instead
  of as a ?token= query string, so it doesn't leak into nginx access logs.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
- youtube_parser/main.py is now a translation layer over content_parser.cli:
  it parses the original argument set ('--query', '--video', '--max-comments',
  '--include-replies', '--no-transcripts', etc.) and rewrites it into the new
  '--source youtube --set key=value' form. Removes ~150 lines of duplicated
  CLI logic that drifted away from the new output layout.
- ui/app.py _render_field now handles a 'select' widget with no options
  and no default by falling back to a free-text input, so a misconfigured
  FieldSpec doesn't crash the whole UI.
- .gitignore picks up .content_parser/ (saved-secrets dir) and
  .pytest_cache/.
- tests/ adds 34 unittest cases (no extra dependency, runs with stdlib):
  TOML upsert/escape/round-trip, runner partial-run safety, Instagram
  account validation + per-kind dispatch + Apify Bearer auth, Apify
  adapter field mapping, legacy CLI flag translation. Runs via
  'python -m unittest discover -s tests'.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Three input kinds:
- subreddit (name or URL, with or without 'r/' prefix)
- query (full-text search across all of Reddit)
- post_url (specific thread for comment analysis)
- user (posts by a given Redditor — competitor tracking)

Settings cover the listing knobs (hot/top/new/rising/controversial),
time_filter for top/controversial, max posts per input, comment
collection (top-level only by default, mirroring the YouTube plugin),
and an opt-in expand_more_comments flag for users who want the full
tree at the cost of slower scrapes.

The adapter maps PRAW Submission/Comment objects into the unified Item
schema: score / upvote_ratio / num_comments / NSFW + locked / spoiler
flags / external link domain go into media; awards and post_hint go
into extra. Deleted authors render as "[deleted]" rather than None.
Comments are flattened with parent_id linkage so the same Markdown
renderer that handles YouTube replies works unchanged.

Secrets needed: REDDIT_CLIENT_ID + REDDIT_CLIENT_SECRET (free, created
at reddit.com/prefs/apps as a "script" app). REDDIT_USER_AGENT is
optional with a sensible default.

Adds 41 new tests (75 total) covering adapter field mapping, input
normalization (subreddit/user prefixes, URL parsing), reject paths
(invalid chars, listing URL in post_url field), comment depth + cap
behavior, and PRAW listing dispatch via mocks. praw>=7.7 added to
requirements.txt.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
- _file_stem now passes source and item_id through _safe_filename, not
  just title. Defense in depth against an upstream API returning a
  malicious id like '../../etc/passwd' that would have escaped the
  output directory. Verified by tests that hit write_item_json /
  write_item_markdown with traversal attempts and assert the resulting
  path stays under out_dir.
- _is_reddit_post_url now matches host exactly (== 'reddit.com' or
  endswith '.reddit.com', same for redd.it). The previous substring
  check let 'evilreddit.com' and 'reddit.com.evil.example' through.
  Tests added for the lookalike rejection plus a positive case for
  legitimate subdomains like old.reddit.com.
- build_reddit logs a WARNING when REDDIT_USER_AGENT is unset, before
  falling back to a generic default. Reddit's API rules ask for a
  username-bearing UA; the warning surfaces the misconfiguration that
  would otherwise just look like flaky rate limits.
- Reddit fetch errors now go through _redact_spec, which strips query
  strings and caps length to 80 chars. Prevents accidentally pasting a
  URL with ?token=... into the field and seeing it echoed back through
  exception messages and Streamlit logs.
- README.md adds a 'Sharing scraped results' section warning that
  comments are written to Markdown unescaped — fine for personal
  viewing, but raw output/ should not be republished without a
  sanitizer because of Markdown link injection vectors.
- 19 new tests (94 total): _safe_filename behavior, _file_stem path
  traversal, write_item_* containment, _is_reddit_post_url lookalike
  rejection + subdomain acceptance, _redact_spec behavior, and
  build_reddit's logging assertion via patch.dict on sys.modules.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…ck, host symmetry

Should-fix items from the second review pass:

- _file_stem now appends a short sha256 prefix when item_id sanitizes to
  the fallback ('item'), so two items whose ids both reduce to special
  chars no longer clobber each other on disk.
- _redact_spec also strips URL fragments (#access_token=...) in addition
  to query strings, since OAuth implicit-flow tokens travel there.
- build_reddit now treats whitespace-only REDDIT_USER_AGENT as missing
  and falls back to the default with the WARNING log, instead of
  silently passing whitespace through to PRAW.
- _normalize_subreddit and _normalize_user reject non-Reddit hosts when
  given a URL, mirroring _is_reddit_post_url. Cosmetic — PRAW would
  still hit api.reddit.com — but keeps validation symmetric.

Nice-to-haves while we're here:

- replace_more on expand_more=True is now hard-capped at 32 expansions
  (constant _MAX_REPLACE_MORE) instead of unbounded. Unbounded calls
  could pull thousands of comments and minutes of latency on big threads.
- 'rising' listing on a user (PRAW doesn't expose it) falls back to
  'new' with an INFO log so the user sees why the result differs.
- _is_reddit_host extracted as a shared helper used by all three URL
  validators.

8 new tests (102 total) cover stem collision avoidance, fragment
redaction, whitespace UA fallback, non-reddit host rejection in both
normalizers, replace_more cap, and the rising→new log.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
@codeby codeby changed the title Modular content parser: YouTube + Instagram plugins Modular content parser: YouTube + Instagram + Reddit Apr 29, 2026
claude added 11 commits April 29, 2026 09:41
Three input kinds:
- query: groups.search → wall.get for each found community
- community: screen_name / club<id> / numeric / vk.com URL
- post_url: vk.com/wall<owner>_<post>

Settings cover the whole pipeline: max communities per query, max posts
per wall (capped at VK's 100/call), fetch_comments toggle, max comments
per post (paginated via wall.getComments offsets), and comment_depth
top_level vs all (with thread_items_count=10 when 'all').

The adapter resolves author names via the profiles + groups arrays
returned by extended=1 calls — no extra users.get / groups.getById
roundtrips. Negative owner_ids correctly map to club<id>; positive ones
to id<id>.

Security carry-overs from the previous reviews:
- VKClient sends access_token in the POST body, never query string.
- VK error_code 5/17/27/28 → AuthError; 6/9/29 → RateLimitError; rest
  → PluginError. UI surfaces these distinctly.
- _normalize_community and _extract_wall_id reject non-VK hosts (vk.com,
  vk.ru, m.vk.com only — substring match would let evilvk.com through).
- _normalize_community rejects VK reserved paths (feed, im, video, etc.)
  that would otherwise look like screen names but aren't communities.
- _redact_spec strips ?query and #fragment before logging.

47 new tests (149 total): adapter field mapping for posts/comments and
user vs group label resolution, normalization (screen_name / club /
URL / lookalike host / reserved path), wall ID extraction, _redact_spec,
client error code mapping, token-not-in-URL invariant, and fetch
dispatch for query/community/post including dedupe across mixed inputs.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…apter

Should-fix items from the combined review:

- _fetch_comments now checks the cap *before* every append (top-level
  AND reply), so depth=all on a thread with hundreds of replies no
  longer overshoots max_comments by one. Also short-circuits pagination
  using the response's `count` field instead of doing one extra
  round-trip just to see an empty page.
- VKClient retries RateLimitError (codes 6/9/29) with exponential
  backoff (1s, 2s, 4s, ... up to max_rate_limit_retries=3 by default)
  before bubbling up. AuthError and other PluginErrors are not retried.
  _sleep is a static method so tests can patch it without timing flakes.
- VKClient now uses a single requests.Session for the whole client
  lifetime, so we don't pay the TLS handshake on every API call.
- post_to_item raises ValueError when owner_id or id is missing,
  instead of silently constructing item_id="0_0" which would collide
  across multiple malformed posts.
- _collect_for_spec post-path no longer duplicates the
  group/profile-cache lookup that the adapter already does via
  _label_for_id; just appends (post, None) and lets the adapter resolve.
  Extracted the shared response-merging logic into _extract_extended.

9 new tests (158 total): retry-then-succeed, give-up-after-max-retries,
auth-not-retried, top-level cap exact, depth=all overflow control,
single-page short-circuit on count, multi-page pagination continues
when count > page, adapter ValueError on missing fields. The earlier
ClientErrorMappingTest cases were updated to patch requests.Session
(not requests.post) since the client now uses a session.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Two input kinds:
- channel: @username, plain username, or t.me URL (parses recent messages)
- post_url: t.me/<channel>/<msg_id> for a specific post + its comments

Reuses APIFY_API_TOKEN from the Instagram plugin so a Streamlit Cloud
user only configures the Apify secret once. The default actor is
apify/telegram-channel-scraper but actor_id is exposed as a setting
so it can be swapped (e.g. 73code/telegram-scraper) without code edits.

The adapter is field-shape-defensive because different Telegram
scrapers on Apify use different key names: _pick walks a list of
likely keys, _reactions_total accepts a list of {emoji, count} dicts,
a flat {emoji: count} mapping, or just an int. Comments embedded in
the message dict (replies_data, comments, discussion, thread.items)
all parse to the same Comment list.

Security carry-overs from the prior reviews:
- _is_tg_host does exact-match on t.me / telegram.me to reject
  evilt.me and t.me.evil.example
- _normalize_channel rejects Telegram reserved paths (joinchat, proxy,
  iv, etc.) that would otherwise look like usernames
- _extract_post_url rejects /c/<chatid>/ private-channel paths since
  the public scrapers cannot read them
- _redact_spec strips ?query and #fragment before logging
- post-fetch comment count is capped to max_comments_per_post even
  when the actor returns more

49 new tests (207 total): _pick fallback chain, _reactions_total over
all three reaction shapes, message_to_item with primary and alt field
names, zero-views preserved, inline-comment extraction, alternative
field-name fallbacks, host validation lookalike rejection, reserved
path rejection, /c/ private path rejection, dispatch one-actor-call vs
two for mixed inputs, actor_id override, dedupe across channel+post,
and comment cap enforcement.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…nt, reply tree

Should-fix items from the combined review:

- actor_id is validated against ^[A-Za-z0-9_-]+[/~][A-Za-z0-9_.-]+$.
  Whitespace-only or unset falls back to the default actor cleanly;
  garbage like 'noslash' or '/missing' raises PluginError up-front
  instead of being sent to Apify and producing a confusing 404.
- fetch() does ONE pass through actor results: parses each message
  to Item, dedupes by item_id in the same loop. The previous version
  parsed each message twice (once for dedupe key, once for yield) —
  doubles the adapter cost on big result sets.
- _replies_count handles both shapes: 'replies: 42' (number) and
  'replies: [...]' (list of comment dicts → use len). Previously
  number-only responses left media.comments_count as None.
- _extract_comments now also looks at the bare 'replies' field for
  comment lists (not just replies_data/comments/discussion/thread).
- Reply tree linkage: when a comment has reply_to_message_id (or
  replyToMessageId / reply_to_msg_id) and the parent is in the same
  fetched batch, we set parent_id accordingly so the Markdown writer
  can render the thread structure. Out-of-batch references stay top-level.
- _is_private_channel_url helper catches t.me/c/<chat_id>/... before
  _extract_post_url returns None, raising an explicit PluginError that
  tells the user the URL is private and Apify scrapers can't read it.
- _to_int defensively coerces numeric values, refusing to silently
  store a stray dict (e.g. {'count': 100}) in media when an actor
  uses an unexpected schema. Applied to views/forwards counts.
- Cosmetic: media_obj computed once instead of msg.get('media') twice.

25 new tests (232 total): _to_int across all input shapes including
the dict-leak guard, _replies_count for int/list/alt-keys,
reply_to_message_id parent linkage with both inside and outside-batch
references, dict-views does-not-leak, actor_id validation across
five garbage forms plus default fallback for empty/whitespace,
ApifyError → PluginError wrapping for both channels and posts paths,
private /c/ URL explicit error.

Also fixes a regression introduced in the previous edit pass where
_channel_label lost its def line and became a continuation of
_replies_count's body — caught by the test suite immediately.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Pull values from a column or range of any Google Sheets spreadsheet and
drop them into the active plugin's input tab. Same loader code will be
called from the cron runner in the next step, so it's designed for both
interactive and headless use.

Auth uses a Google Cloud service account: paste the JSON key into the
GOOGLE_SHEETS_CREDENTIALS secret, then share each target spreadsheet
with the service account's email (visible via .service_account_email()
helper for UX hints).

Loader API (content_parser/loaders/gsheets.py):
  loader = GoogleSheetsLoader.from_secrets({"GOOGLE_SHEETS_CREDENTIALS": ...})
  loaded = loader.load(sheet_id_or_url, tab="Communities", range_a1="A:A",
                       skip_header=False)
  loaded.values  # ['durov_says', 'telegram', ...] — flattened, deduped, trimmed
  loaded.sheet_title / loaded.tab_title / loaded.count

Sidebar block "📥 Загрузить из Google Sheets" exposes the same loader
under any plugin: paste creds, paste sheet URL, pick tab + range, pick
which input kind (channel / community / hashtag / etc.) to populate, hit
Загрузить. Loaded values append to the existing input field (preserving
manual entries), so several sheets can be merged before running.

Defensive behavior:
- credentials JSON is validated for type/client_email/private_key keys
  before sending to gspread, with a clear AuthError if it's e.g. an
  OAuth client JSON instead of a service account key.
- Sheet URL extraction tolerates the ID alone, the full /d/<id>/edit URL,
  and trailing query params.
- A1 range validated against a permissive regex; an actual range error
  from the API surfaces with the user's range echoed back.
- 403 from Google → AuthError with "share the sheet" hint. 404 →
  PluginError with "check the URL/ID".
- Unknown tab name → PluginError listing the tab names that DO exist.

20 new tests (252 total): credentials validation across all four
malformed forms (non-JSON string, JSON-but-not-dict, missing field,
missing secret), sheet ID extraction (bare ID / full URL / URL with
query / garbage / empty), load() with single column / multi column /
deduplication / blank-skipping / skip_header / invalid range / unknown
tab / 403 / 404 / default-first-sheet.

requirements.txt: +gspread>=6.0, +google-auth>=2.20 (the latter was
already a transitive dep of google-api-python-client; pinning it
explicitly makes the loader self-contained for cron use later).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Should-fix items:

- _extract_sheet_id now validates the URL host strictly (must be
  docs.google.com). The previous regex.search would happily pull
  '/d/<id>/' out of any URL, including https://evil.com/.../d/<id>/...
  Not an SSRF (we don't fetch the user URL — the ID just becomes a
  parameter to the Google Sheets API), but the silent acceptance was
  misleading. Lookalike hosts and other google subdomains
  (mail.google.com etc.) are now rejected explicitly.
- validate_credentials extracted as a static method that does the
  shape check WITHOUT building a gspread client. The save button now
  validates pasted JSON via this helper before persisting, so users
  see "JSON невалиден: …" immediately instead of saving garbage that
  fails on next load.
- Service-account 'type' field is now checked too: an OAuth client
  JSON (type=authorized_user) is rejected with a message that points
  the user to the right kind of credential.
- All UI buttons in this block translated to Russian (Сохранить /
  Удалить / ✏️ Заменить / ✕ Отмена) — was English-Russian mixed.

Nice-to-haves while we're here:

- After creds are saved, the field collapses to a one-line summary:
  "✓ Учётка сохранена: bot@project.iam.gserviceaccount.com" with
  a hint to share the spreadsheet with that email — addresses both
  the "where do I find this?" UX gap and the security concern of
  re-rendering the full RSA private key in plain text on every load.
  An ✏️ Заменить button reveals the textarea again.
- A warning caption above the JSON field reminds the user that the
  JSON contains a private key.
- st.spinner around the load call so the UI shows progress feedback.
- Empty / whitespace 'tab' parameter falls back to the first sheet
  (matters for cron configs that may pass tab="").
- raw_rows dropped from LoadedRange — was populated but never read,
  carried unnecessary copies of full sheet data in memory.

8 new tests (260 total): non-google host rejected (with explicit
docs.google.com hint in error), lookalike host rejected, other-google
subdomain rejected (mail.google.com), OAuth client JSON rejected,
validate_credentials does NOT call _build_client, empty/whitespace
tab fallback to first sheet, raw_rows attribute removed from LoadedRange.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Job describes one scheduled run: source plugin, inputs (inline list and/or
Google Sheets references), settings, optional cron schedule. Job-files
live in ~/.content_parser/jobs/<name>.yaml and are read with yaml.safe_load
to keep the door closed on !!python/object construction tricks.

Schema (jobs/schema.py):
- Job dataclass with validate(): rejects bad names (regex
  ^[A-Za-z0-9_-]{1,64}$), missing source, invalid cron expressions, jobs
  without any inputs, malformed sheet_inputs, unknown notify_on_failure.
- SheetInput dataclass mirrors GoogleSheetsLoader.load() args.
- is_valid_cron loosely accepts standard 5-token expressions and @-aliases
  (@daily, @Weekly, ...). It refuses garbage like 'rm -rf /' that contains
  characters outside [\d*/,-A-Za-z].
- resolved_output_dir() returns output/scheduled/<name>/<timestamp>/ by
  default, an absolute output_dir as-is, or a relative one resolved
  against cwd. The timestamp suffix is always appended.

Store (jobs/store.py):
- list_jobs() / load_job() / save_job() / delete_job() / job_exists().
- Path resolution validates the candidate is inside JOBS_DIR via
  Path.resolve() + relative_to() — defense in depth even though the
  job-name regex already keeps slashes out.
- list_invalid() returns (name, error) pairs for files that fail to
  parse, so the UI can surface broken jobs instead of silently dropping.
- save_job sets chmod 600 (best-effort).

Runner (jobs/runner.py):
- run_job(name) loads the YAML and runs run_job_obj(job).
- _resolve_inputs merges inline values with Sheets-loaded values per
  input kind, then dedupes preserving insertion order, then drops empty
  kinds.
- _collect_secrets pulls plugin secret_keys + GOOGLE_SHEETS_CREDENTIALS
  if any sheet_inputs present + the same WEBSHARE_/PROXY_ optional set
  the CLI/UI uses.
- On success: writes .last_run.txt marker. On failure: writes
  last_error.txt with traceback unless notify_on_failure='none'. The
  original exception is re-raised so cron sees a non-zero exit.

48 new tests (308 total): cron expression validation across standard
and alias forms (and rejection of cmd-injection-shaped garbage), Job
validation across every guard (bad name / no source / no inputs /
invalid schedule / malformed sheet ref / unknown notify), YAML
round-trip + safe_load enforcement (rejects !!python/object), name_hint
fallback when YAML omits 'name', range vs range_a1 alt key,
resolved_output_dir for default/relative/absolute, store CRUD with
path-traversal rejection, list_jobs sorting + skip-invalid behavior,
chmod 600 on save, runner input merge with dedupe, secret collection
(plugin / sheets-needed / optional proxy), success/failure marker
writing, notify_on_failure=none suppresses error file, empty resolved
inputs raise PluginError before plugin is touched.

requirements.txt: +pyyaml>=6.0.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
cron.py manages a marker-bounded block in the user's crontab without
ever touching lines outside our markers:

  # >>> content_parser jobs >>>
  0 6 * * MON cd /repo && python -m content_parser.cli jobs run weekly  # job:weekly
  # <<< content_parser jobs <<<

API:
- install_cron(jobs=None, project_root=None, python_executable=None,
  log_path=None) — collects every job with a schedule, regenerates the
  managed block. Idempotent: running twice with the same jobs yields the
  same crontab. Existing user lines outside the markers are preserved.
- remove_cron() — strips the block, returns True/False.
- read_block() — best-effort parse of currently-installed entries
  (schedule, job_name, command).

Safety:
- shlex.quote on every path/argument that goes into the cron command,
  so even a hypothetical bad job name (which the schema regex already
  rejects) couldn't inject extra shell metacharacters.
- Friendly errors for missing crontab binary and 'no crontab' state.

CLI subcommand `jobs`:
- jobs list           → tabulated overview of all saved jobs + invalid files
- jobs show <name>    → dump a job's canonical YAML
- jobs run <name>     → invoke run_job() with stdout logging and progress
- jobs install-cron   → regenerate the managed block
- jobs remove-cron    → strip the managed block
- jobs cron-status    → show what's currently in the block

18 new tests (326 total): _strip_block leaves outside lines untouched
and handles block-at-start, _build_block produces marker-wrapped lines
with # job:<name> footer, build_command_for_job shell-quotes paths with
spaces and uses safe paths as-is, install_cron idempotency across runs,
jobs without schedule are skipped, lines outside markers preserved
through reinstall, remove_cron only writes when block exists,
read_block parses entries back, _existing_crontab returns "" for the
"no crontab for user" case but raises on real errors and on missing
binary.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New '🕐 Расписание' section appears at the bottom of every plugin's page.
Lets the user:

1. List existing jobs with collapsed details: source, schedule (or "ручной
   запуск" badge), description, inline inputs and Sheet refs summary.
2. Per job: ▶️ Запустить (calls run_job_obj with live log), ✏️ Изменить
   (raw YAML editor with Save/Cancel), 🗑️ Удалить.
3. ➕ Создать job из текущего состояния — captures the current input
   tabs + plugin settings into a new YAML file. Bare-minimum form: name,
   optional cron, optional description; sheet_inputs added by editing
   the YAML afterward (since they need URL/tab/range fields).
4. 📅 Cron section, automatically grayed out on hosts without `crontab`
   binary (Streamlit Cloud) — there it shows a copy-paste GitHub Actions
   workflow as the alternative path. On hosts with crontab: install /
   remove buttons + summary of currently-installed entries.

UI gracefully surfaces invalid YAML files via list_invalid() so a user
who hand-edited a file and broke it can see the parse error instead of
having the job silently disappear.

is_cron_available() helper added to jobs/cron.py — runs a one-shot
`crontab -l` and catches FileNotFoundError. UI calls it once per render
to decide whether to show the install/remove buttons or the GH Actions
template.

Run button label updated to "▶️ Запустить (разово)" to disambiguate
from the per-job ▶️ buttons in the Schedule panel.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…friendly CLI

Should-fix items:

- inputs YAML parser now refuses non-list values per kind. The previous
  comprehension iterated strings character-by-character, so the typo
  'community: durov_says' (no brackets) silently produced ['d','u','r',...].
  Fix raises a clear PluginError before the typo can corrupt a run.
- Job.validate() now rejects '..' anywhere in output_dir parts. Absolute
  paths still go through (user explicitly opts in), but the path-traversal
  case '../../etc' or 'custom/../escape' is caught at validation.
- build_command_for_job rejects newlines and carriage returns in any path
  fragment (project_root, python_executable, log_path) and in job.name.
  shlex.quote happily preserves a literal \n inside its single-quoted
  output, which would split a crontab entry across two lines and corrupt
  the file. The schema's job-name regex already covers job.name, but the
  defense is added there too for future-proofing.
- cli jobs run wraps run_job in try/except for AuthError, PluginError and
  KeyError (unknown source from get_plugin), printing a friendly stderr
  message and returning exit code 1 instead of dumping a Python traceback.
- run_job_obj now computes resolved_output_dir() ONCE up-front. Earlier,
  a Sheets-load failure or empty-resolved-inputs would call
  job.resolved_output_dir() twice — once for the eventual run, again to
  pick a place for last_error.txt — producing two timestamped directories
  that differ by milliseconds. Now both markers land in the same dir.

12 new tests (338 total): output_dir rejected with .. at start / middle,
absolute and normal-relative output_dirs accepted, string / int / dict
values in inputs raise on YAML load (with the "must be a list" hint),
empty input value treated as empty list, newline rejected in
project_root / python_executable / log_path, carriage return rejected.
The empty-resolved-inputs test in test_jobs_runner already verified the
single-out_dir behavior end-to-end (passes with the refactor).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New content_parser/transcription/ module wires yt-dlp + OpenAI Whisper into
the existing Item.transcript field. When the user enables 'transcribe_videos'
in a plugin's settings, each Item with a video URL is downloaded as audio
(MP3 64 kbps, well under the 25 MB Whisper API limit), shipped to
api.openai.com/v1/audio/transcriptions, and the verbose_json segments are
mapped onto the existing Transcript schema so the Markdown writer renders
them the same way as YouTube subtitles.

Module layout:
- downloader.py — yt-dlp wrapper with FFmpegExtractAudio postprocessor and
  a 24 MB filesize cap. get_duration_seconds() probes without downloading
  for budget gating.
- whisper_api.py — minimal Bearer-auth HTTP client (just `requests`,
  no `openai` package). Distinguishes 401 (bad key), 429 (rate limit),
  and other 4xx/5xx with the API's error message.
- cache.py — ~/.content_parser/transcription_cache/<source>_<id>.json,
  so re-running a job doesn't re-pay for previously transcribed items.
- runner.py — maybe_transcribe(item, settings, secrets, only_if_missing=)
  is the single entry point plugins call. Order: cache check →
  duration cap → download → API → cache write.

Plugin integration:
- Instagram, VK, Telegram add `transcribe_videos` (bool, default off) and
  `max_audio_seconds_per_video` (default 600) FieldSpecs and call
  maybe_transcribe inline in fetch().
- YouTube treats Whisper as a fallback: only_if_missing=True means it
  runs only when youtube-transcript-api couldn't return segments (subs
  disabled, blocked, etc.). Avoids wasting API on videos that already
  have free subtitles.

UI:
- Sidebar shows an inline 'Параметры Whisper' expander when
  transcribe_videos is checked, with OPENAI_API_KEY input + save/clear
  buttons + caption about the cost and ffmpeg requirement.
- OPENAI_API_KEY is in the optional shared-secrets list, so a saved
  value is picked up across plugins and by the cron runner.

Security carry-overs:
- Token in Authorization: Bearer header, never URL.
- _video_url_for prefers the canonical post URL (e.g. instagram.com/reel/AAA/)
  over CDN URLs in media.video_url, since CDN tokens often expire while
  yt-dlp can re-resolve from the post URL fresh.
- Cache filenames go through _safe regex so a malicious upstream id like
  '../../etc' can't escape the cache dir.
- Hard cap on audio duration before download blocks surprise costs.

24 new tests (362 total): cache CRUD with path-traversal sanitization;
whisper_api Bearer header / verbose_json format / language passthrough /
401 / 429 / other-error message extraction / valid response parsing;
maybe_transcribe disabled-by-setting / no-key-sets-error / cache-hit-
skips-network / full-pipeline-downloads-and-caches / duration-cap-blocks
/ download-failure-recorded / whisper-failure-recorded /
only_if_missing-skips-when-present / only_if_missing-runs-when-empty /
no-video-url-silent / prefers-canonical-url-over-cdn.

requirements.txt: +yt-dlp>=2024.0. ffmpeg required at runtime
(documented in plugin help text).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
claude added 5 commits April 29, 2026 11:40
…ion pin

Should-fix items:

- runner.maybe_transcribe now refuses URLs that aren't public HTTP(S):
  loopback hostnames (localhost / 0.0.0.0), IPv4/IPv6 literals in private
  RFC1918 ranges, link-local (169.254.0.0/16 incl. AWS metadata), reserved
  and loopback. yt-dlp would otherwise happily fetch from internal
  networks if any third-party API (Apify/VK/Telegram actor) ever returned
  such a URL — chain-of-trust SSRF. Bare DNS names still pass since
  resolution happens later in yt-dlp; this layer only catches literals.
- runner.maybe_transcribe blocks transcription when get_duration_seconds()
  returns None. Without a known length the per-video Whisper bill is
  unbounded; refusing is the cheap-and-safe default. Earlier code fell
  through this branch and would download anyway.
- whisper_api.transcribe_audio retries 429 (rate limit) and 5xx (server
  error) up to max_retries=2 with exponential backoff (2s, 4s). 401/4xx
  other than those surface immediately. _sleep is a module-level helper
  so tests patch it without slowing the suite — TranscribeAudioTest's
  test_429_rate_limit was updated to use max_retries=0 for the no-retry
  semantic.

Nice-to-haves:

- yt-dlp pinned to >=2024.0,<2027.0 to bound supply-chain blast radius
  if a future major version ever ships a malicious extractor.
- UI caption under Параметры Whisper now mentions that the saved key
  persists across checkbox toggles — only 🗑️ removes it.

17 new tests (379 total): _is_public_url across normal URLs, http variant,
non-http schemes, localhost / 0.0.0.0 / 127.0.0.1 / IPv6 ::1, RFC1918
(10/172.16/192.168), link-local 169.254 (AWS metadata), IPv6 fc00::/7,
empty/invalid input, DNS names pass through; runner blocks on private
URL before any download; runner blocks on duration unknown; Whisper
retry on 429-then-success / 500+503-then-success / exhausted retries;
401 and 400 do NOT retry (single call only). Existing test_429_rate_limit
adjusted for new retry semantics.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Three cross-cutting issues from the full-project review, in one batch:

1. GitHub Actions tests workflow (.github/workflows/tests.yml). Runs
   on every push and PR-to-main against Python 3.11 and 3.12, installs
   requirements.txt, runs `python -m unittest discover -s tests -v`,
   and smoke-checks `cli list-sources`. No more silent regressions
   between manual reviews.

2. _redact_spec was reimplemented in three plugins (Reddit, VK,
   Telegram), each with the security-relevant job of stripping ?query
   and #fragment from URLs before they hit logs or exception messages.
   When we added fragment-stripping to Reddit, the others were missed
   for a release. Extracted to content_parser/core/redact.py as
   redact_spec(). All three plugins now import the single canonical
   implementation; tests import from core.redact too (aliased to
   _redact_spec locally to keep diffs small).

3. ApifyClient lived in plugins/instagram/apify_client.py and Telegram
   imported from there — runtime cross-plugin dependency that would
   silently break if Instagram were renamed or removed. Moved to
   content_parser/clients/apify.py (a new top-level package for shared
   HTTP clients). Both Instagram and Telegram now import from the
   shared module; Instagram's old file is deleted. Tests adjusted to
   patch the new module path (content_parser.clients.apify.requests.post).

All 379 tests still pass after the moves.

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
New 'instagram_graph' plugin alongside the existing public 'instagram'
(Apify) plugin. Different tool for different jobs:

  - 'instagram'        — публичные посты любого аккаунта (Apify, $$$)
  - 'instagram_graph'  — твои посты + insights (Meta Graph API, бесплатно)

What it gives that Apify can't:
  - Insights — reach, impressions, plays, saved, shares, total_interactions
    on your own Reels and posts
  - Full comment threads with replies and like counts
  - No per-item Apify cost
  - Stable, Meta-supported endpoint

Files:
  - plugins/instagram_graph/client.py — GraphClient over graph.facebook.com
    with retry-on-429/5xx exponential backoff (2s, 4s), pagination via
    paging.next URL walking, embedded-token replacement so a 'next' URL
    can't smuggle a different token through, error-code mapping
    (190/102/etc → AuthError; 10/200/803 → AuthError "permissions";
    4/17/32/613 → RateLimitError).
  - plugins/instagram_graph/adapter.py — media_to_item maps a Graph
    media object (IMAGE/VIDEO/REEL/CAROUSEL_ALBUM) to core.Item with
    insights flattened into media dict; flatten_comments folds inline
    replies (replies.data) into the flat Comment list with parent_id.
  - plugins/instagram_graph/plugin.py — InstagramGraphPlugin with two
    inputs: 'account' (Business Account ID, 15-20 digits regex-validated)
    and 'post_id' (numeric media ID). Settings: max_posts_per_account,
    fetch_comments / fetch_replies / fetch_insights toggles,
    max_comments_per_post, plus the standard transcribe_videos /
    max_audio_seconds_per_video pair. Whisper integration via the same
    maybe_transcribe call as other video plugins.

Plumbing:
  - registry.py registers the new plugin alongside the existing five.
  - jobs/runner.py adds INSTAGRAM_ACCESS_TOKEN to the optional secrets
    list, so cron jobs pick it up automatically.
  - ui/app.py shared-secrets list extended too.

Auth requirements (documented in plugin.py docstring):
  Convert IG account to Business/Creator → connect to a FB Page →
  create Meta Developer App → generate long-lived token via Graph API
  Explorer with scopes instagram_basic, instagram_manage_comments,
  pages_show_list, business_management → store as INSTAGRAM_ACCESS_TOKEN.

Insights are best-effort: if the /insights call returns a permissions
error (common on archived posts or older media), we swallow it and
continue with the rest of the run instead of dying.

42 new tests (421 total): client (token always overrides embedded ones,
401/code-10/code-4/5xx error mapping, retry-on-429-then-success, retry
exhaustion, pagination across pages, max_items early-exit, embedded-
token override on next URL); adapter (insights envelope flattening
across dict/list shapes, media_to_item field mapping for REEL +
non-REEL, owner_username override, missing-id raises, falls back
gracefully when owner_username not passed, comment_to_core for top
and reply, flatten_comments two-level expansion); plugin (resolve
validates account-id length and post-id format, dedupe across inputs,
fetch dispatch for account-path / post-path / mixed-with-dedupe,
fetch_insights=False skips /insights endpoint entirely, insights
failure does NOT abort the run).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Should-fix items:

- GraphClient now scrubs the access token from any RequestException
  message before raising. The `requests` library sometimes embeds the
  full URL — including ?access_token=… — in connection-error messages,
  which would otherwise propagate to last_error.txt / Streamlit logs /
  CLI stderr. The exception is re-raised with `from None` so the chained
  __cause__ doesn't keep the unredacted original around either.
- is_reel boolean now has explicit parens —
    (media_type == "REEL") or (media_type == "VIDEO" and product_type=="REELS")
  — instead of relying on Python's `and > or` precedence, which is easy
  to misread.
- media_to_item accepts insights as a keyword argument instead of having
  callers mutate `media["insights"]`. The plugin now passes the freshly-
  fetched insights data through; media dict stays read-only.
- Stale comment about a non-existent _get_url helper replaced with
  accurate description of what get_paginated actually does.

4 new tests (425 total): RequestException with the token in its message
gets [REDACTED] in the propagated PluginError; chained __cause__ is None
so the secret doesn't leak through traceback.format_exc; 5xx-then-5xx-
then-success retries with exponential backoff (mirrors the existing 429
test); insights metric selection differs for REEL vs IMAGE media types
(REEL gets plays+total_interactions, IMAGE gets impressions+reach+saved).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
…c cache, Cloud detection, status file

Six findings from the project-wide audit landed in one batch:

1. CSV formula injection guard (core/output.py). Excel/Sheets/LibreOffice
   execute any CSV cell starting with =, +, -, @, \t, \r as a formula
   (=cmd|'/c calc'!A1 is the canonical RCE proof-of-concept). User-
   controlled fields like title and author can come straight from
   Apify/Reddit/YouTube comments, which means any of our scrapes could
   ship a CSV that runs shell commands when a non-technical viewer opens
   it in Excel. _csv_safe prepends a single quote to neutralize the
   formula while keeping the value visible. Applied to every string
   column going into summary.csv.

2. Token redaction in last_error.txt (jobs/runner.py). The previous
   implementation wrote `traceback.format_exc()` raw — and tracebacks
   carry the chained exception's message, which can include API URLs
   with ?access_token=… in the query (we redact at the source for
   Instagram Graph but not in every other plugin's exception path).
   _record_failure now scrubs every secret value it knows about (8+
   chars only, to skip noise) before the file lands on disk. Both
   call sites pass `secrets=secrets` from collect_secrets.

3. YouTube replies cap honored (plugins/youtube/comments.py). When
   include_replies=True, fetch_comments used to call _fetch_all_replies
   without bound — a single popular top-level comment with 500 replies
   would return 1+500 items only for `comments[:max_comments]` to throw
   most of them away. The fix threads `remaining = max_comments -
   len(comments)` through to _fetch_all_replies, which now stops both
   inside the inline loop and at page boundaries. Also requests page
   sizes proportional to remaining quota.

4. Atomic transcription cache (transcription/cache.py). put() now
   writes to <name>.json.tmp and renames over the final path. POSIX
   guarantees rename atomicity, so a crash during the write leaves
   either the old value or the new value, never a half-written JSON
   that get() catches as ValueError and silently treats as cache miss.

5. Streamlit Cloud detection in secrets layer (core/secrets.py).
   .streamlit/secrets.toml is managed by the Cloud dashboard and
   read-only at the filesystem level. Detect via STREAMLIT_RUNTIME
   env, STREAMLIT_SHARING, or HOSTNAME=streamlit-* and skip the file
   write entirely — local config.json (the other write target) still
   persists so the value works for the current container; users
   mirror it via Settings → Secrets for next deployment.

6. Unified .last_status.json (jobs/runner.py). Both _record_success
   and _record_failure now write a single canonical status file that
   monitoring / UI can stat once for "is this job healthy?". Schema:
   {job, source, status, finished_at, items, error}. Atomic write via
   .tmp+replace as well.

17 new tests (442 total): _csv_safe across all five injection-prone
prefixes (= + - @ \t \r) and the safe-string / None / non-string
passthrough cases; an end-to-end summary.csv test that injects a
malicious title and verifies the round-tripped DictReader sees the
quoted form. record_failure-redaction (passes a secret value, expects
[REDACTED] in last_error.txt) and the new status-file shape. Cache
atomicity (no .tmp left after success, second put() replaces first).
Streamlit Cloud detection (STREAMLIT_RUNTIME=cloud → no file written;
empty env → file written normally).

https://claude.ai/code/session_01XhN8Fp3HF1K4PzwS3oofta
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants